QTM 447 Lecture 3: Generalization Error

Kevin McAlister

January 21, 2025

Generalization Error

Let’s start with a generic truth: the generalization error will always be greater than or equal to the training error

  • We know that this is true.

Proving this is true is a bit difficult.

But, we can build a proof by making an assumption beyond the i.i.d. assumption

Convenience Assumption - Fixed X

Let’s assume that we have two different i.i.d. data sets pulled from \mathcal T of size N, \{\mathbf X_D , \mathbf y_D\} and \{\mathbf X_D , \mathbf y_D'\}

The training features are the same

But the actual outcomes differ due to different noise draws

y = f(\mathbf x) + \epsilon \text{ ; } y' = f(\mathbf x) + \epsilon'

Generalization Error

Last time, we showed that:

\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum \limits_{i = 1}^N Cov(y_i , \hat{y}_i)

where the covariance term is the correlation between the training outcomes and the predictions made at each \mathbf x_i from the training data.

  • For any good predictive model, we expect that these two things are positively correlated.

  • The higher the covariance of the predictions and the training inputs, the larger the gap between generalization error and training error

Generalization Error

When might we expect the covariance between a prediction at \mathbf x and the observed outcome at that point to be high?

Note:

For all learning algorithms, the prediction at any point is a function of the training features and outcomes and the associated feature vector.

\hat{y}_0 = g(\mathbf x_0, \mathbf X , \mathbf y)

Let’s explore this with another example.

Generalization Error

Polynomial Regression with 1 feature:

y = \alpha + \sum \limits_{j = 1}^D \beta_j x^j

As the polynomial degree increases, the wiggliness of the function increases.

The feature matrix grows through basis expansion as the degree increases. For a third degree polynomial:

h(\mathbf x) = [1 , x , x^2, x^3]

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Generalization Error

Let’s assume that we estimated the coefficients using ordinary least squares on the training data:

\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

So, the prediction at any point \mathbf x_0 becomes:

\hat{y}_0 = \mathbf x_0^T \hat{\boldsymbol \beta} = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

  • Each prediction depends on the vector of training outcomes!

  • The training features and value at which the prediction is made create weights for the training outcomes

  • The prediction at any point is a weighted combination of the training outcomes.

Generalization Error

We want to learn the covariance between the training outcomes and the predictions made for the training data

\sum Cov(\hat{ y}_i, y_i)

where

\hat{\mathbf y} = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

So, we’re trying to compute

Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y]

  • Under the fixed \mathbf X assumption, each outcome is the only random part of this expression

Generalization Error

Since covariance is a linear operator:

Cov[\mathbf y , \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T Cov[\mathbf y,\mathbf y] = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \sigma^2

  • An N \times N covariance matrix with the self correlations along the main diagonal

Generalization Error

For any linear model, we can compute the sum of the self-covariances as:

\sigma^2 \text{tr}\left(\mathbf X(\mathbf X^T \mathbf X)^{-1} \mathbf X^T\right)

where \text{tr}() is the trace of the matrix (the sum of the diagonal elements)

  • Anyone know what the trace of this matrix is if \mathbf X is a N \times P matrix?

Generalization Error

The trace is invariant to cyclic transformation

\sigma^2 \text{tr}\left((\mathbf X^T \mathbf X)^{-1} \mathbf X^T\mathbf X\right)

The sum of the self-covariances is then:

P \sigma^2

and the training error offset term is:

\frac{2P\sigma^2}{N}

Generalization Error

Let’s think about what this means:

  • As the number of implied features increases, the covariance between the training outcomes and the predictions increases

  • Put another way, the more implied features in the model, the more the prediction at \mathbf x depends on the value of the outcome at that point and the less it depends on the the value of the outcome at surrounding points

  • For the polynomial model, as the degree increases, more terms.

  • This makes sense because we’re overresponding to wiggliness in the training data

  • Overfitting

  • This effect is diminished as N gets larger

Generalization Error

Three factors that effect generalization performance

  • Training error

  • The covariance of training outcomes and predictions

  • The sample size

Lower any of these (or increase the sample size) and decrease the generalization error!

Generalization Error

Generated from a 3rd degree polynomial with \sigma^2 = 9.

Generalization Error

Generalization Error

Generalization Error

This is great!

  • We understand a little about the link between training error and the generalization error

But, this result really only applies when we can compute the covariance of the prediction and the outcome

  • What is this covariance for an SVM with an RBF kernel?

  • What is this covariance for a random forest?

To make more headway here, we need to present more general results.

Mean Squared Error

We’ll make the most progress exploring our two common loss functions.

Let’s start with MSE for a continuous outcome.

Generalization error under mean squared error:

E_\mathcal T[(y - \hat{y})^2]

In our previous example, we made a fixed \mathbf X assumption

  • \mathbf X_D was always the same any time we took a draw of size N from \mathcal T

Let’s not make that assumption.

Mean Squared Error

Recall that \mathcal T is a multivariate distribution over the features and the outcomes. We can decompose \mathcal T into two components:

\mathcal T = P(\mathbf x) P(y | \mathbf x)

When we do not make the fixed \mathbf X assumption, there are two random components in the process assuming each draw is an i.i.d. draw from \mathcal T:

  • The training set - if reached into \mathcal T and drew another training set of size N, then I would expect that we would receive a slightly different training set (both features and outcomes)

  • The outcomes conditional on the feature draws - even if I drew the same \mathbf x, I would expect that this value differs depending on the values of the noise draws

Mean Squared Error

These two sources of randomness mean that we can split our expectation into two pieces:

E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ E_D \left[ (y - \hat{y})^2 \right] \right]

  • D is our training data. Note that the prediction only depends on the training data drawn. With a different draw of D from \mathcal T, our prediction could be a little different. The prediction made at \mathbf x given the observed training set is \hat{y}.

  • y is random only in terms of the error variance. Given a value of \mathbf x, y = f(\mathbf x) + \epsilon.

We can evaluate these expectations to get a workable form:

E_\mathcal T[(y - \hat{y})^2] = V_{y | x}[y] + V_D[\hat{y}] + (E_D[\hat{y}] - f(\mathbf x))^2

where f(\mathbf x) is the true value of the signal evaluated at \mathbf x (Derivation included at the end of the slides)

Mean Squared Error

Let’s look at each of these terms separately.

Bias

(E_D[\hat{y}] - f(\mathbf x))^2

  • Over all possible i.i.d. training sets drawn from \mathcal T of size N, what is the difference between the prediction and the truth?

Two ways that bias can change:

  • The flexibility of the chosen model with respect to the true functional form

  • Estimation bias for any unknowns in the chosen learning method (more on this later)

Mean Squared Error

Let’s look back at our polynomial model with one feature. Generate a large number of i.i.d. training sets from data generating distribution to see the different models we uncover with 1 polynomial term.

Mean Squared Error

We see that there is a little variation over different i.i.d. training sets of size 100. Let’s look at the average prediction over 1000 i.i.d. training set.

Mean Squared Error

Repeated for a 2nd degree polynomial

Mean Squared Error

Repeated for a 3rd degree polynomial

Mean Squared Error

And a 10th degree polynomial

Mean Squared Error

Since we know the true data generating function, let’s compute (or at least closely approximate) the bias for each polynomial degree between 1 and 15

Mean Squared Error

When we underfit, the bias tends to be high (lower than true polynomial degree)

When we get the right model or overfit, the bias doesn’t decrease anymore!

Note that the bias tends to be bounded at the simplest possible model

  • Over a set of models, the bias can only be so bad.

Bias takeaways:

  • As the complexity of a model increases, the bias of that model w.r.t generalization error decreases in a bounded way

Mean Squared Error

Model Variance

V_D[\hat{y}]

  • Over all possible i.i.d. training sets drawn from \mathcal T of size N, how variable are the predictions?
  • Variation in each training set of size N leads to slightly different “optimal” models (think slightly different regression coefficients) and slightly different predictions

Two ways that model variance can change:

  • The flexibility of the chosen model with respect to the true functional form

  • The size of the training sample

Mean Squared Error

Generate a large number of i.i.d. training sets from data generating distribution to see the different models we uncover with different polynomial degrees.

Mean Squared Error

Mean Squared Error

When the model is least complex, the model variance is its lowest

  • There isn’t much of a difference in where the best fit line goes.

As the complexity increases, the model variance increases

  • Any intuition as to why this is true? Think back to our self-correlation measure.

Model variance tends to increase in an unbounded way as the complexity increases.

Mean Squared Error

What happens as N changes?

Mean Squared Error

As N gets larger, the model variance tends to uniformly decrease across complexity

  • Any intuition as to why this is true?

Model variance takeaways:

  • As the model becomes more complex, the model variance increases

  • As we have a larger training set, the model variance decreases

Mean Squared Error

Irreducible Error

V_{y | x}[y]

  • Even if we knew the true data generating distribution w.r.t \mathbf x, how variable would the outcomes be?

Just the cost of doing business:

  • We don’t have all of the features

  • We may not be able to uncover all of the functional variations that lead to the outcome

It’s just always there and cannot be reduced

  • Think residuals in a linear regression model where we know that we have the correct specification!

Mean Squared Error

This leads us to the common adage:

\text{Generalization Error} = \text{Irreducible Error} + \text{Bias}^2 + \text{Model Variance}

Mean Squared Error

Let C be a value that represents the complexity of the chosen model

  • C increases and the model becomes more complex - like the degree of the polynomial

We can use our general observations about MSE generalization error to make a statement:

\text{Generalization Error} = \sigma^2 + \mathcal O\left(\frac{1}{C}\right) + \mathcal O \left(\frac{C}{N}\right)

  • As C increases, bias decreases

  • As C increases, model variance increases

  • However, with large N, the model variance increase is attenuated and we can tolerate more flexibility!

Mean Squared Error

This leads us to a balance that must exist for good generalizable models:

  • If C is too high, the model variance overwhelms any reduction in bias (overfitting)

  • If C is too low, the bias overwhelms any reduction in model variance (underfitting)

This is referred to as the bias-variance tradeoff

  • Really, the bias-complexity tradeoff

There exists some optimal balance point in terms of complexity that is neither too complex or too simple

  • Just right

0/1 Loss

The generalization under 0/1 loss is:

E_\mathcal T[I(y \neq \hat{y})]

  • For any test instance, we’re either right or we’re wrong

  • We would like to know over the entirety of the data generating distribution what is the probability that any one instance is incorrect.

This expectation doesn’t really have any super nice decompositions

  • It’s just indicator functions

0/1 Loss

This matters because we can’t do the same thing that we did for MSE to show how the generalization error is determined by the bias and variance of the model.

  • The bias and variance under 0/1 loss isn’t quite as meaningful of a value since the only possible outcomes are in a (usually small) set of discrete outcomes.

0/1 Loss

Are there any guarantees about generalization error for the 0/1 loss function?

Let’s assume that we’re working with a two-class problem and our goal is to compute the expected 0/1 loss for the true data generating process, \mathcal T.

  • This theorem applies to the multiclass problem, as well.

0/1 Loss

The Vapnik-Chervonenkis (VC) Bound

For a binary classification problem with a training set of size N, with probability 1 - \delta the following inequality on the 0/1 error holds:

\text{Generalization Error} \le \text{Training Error} + \sqrt{\frac{1}{N} \left[ d \left(\log \frac{2N}{d} + 1\right) - \log \frac{\delta}{4} \right]}

  • The proof to get this bound is gnarly

  • Relies on Hoeffding’s inequality, the concept of PAC learnability, and Sauer’s lemma

  • The interested student can find this proof in Chapter 6 of “Understanding Machine Learning: Theory and Algorithms”

  • Note that N - the training size - and \delta - the probability of the guarantee - are fixed and known!

0/1 Loss

The new variable here is d

  • Called the VC dimension of the learning algorithm

Formal definition:

For a learning method, \mathbf H, the VC dimension is defined as the size of the largest finite subset of an instance space, \Omega, shattered by \mathbf H

Shattering:

A classifier is said to shatter an instance space of size d if the classifier can create correct decision regions over all 2^d sets of unique labelings over a fixed set of non-collinear points.

0/1 Loss

This doesn’t make a lot of sense on its own.

So, let’s draw some pictures!

  • Linear classifier in 2D feature space with 2,3, and 4 points

  • Quadratic classifier in 2D space with 4 points

0/1 Loss

0/1 Loss

0/1 Loss

0/1 Loss

0/1 Loss

The VC dimension is a statistically sound way of measuring the complexity of a classifier

  • Sort of like measuring the wiggliness of the classifier

Some methods have a finite VC dimension:

  • Any linear classifier using P features with a free intercept has a VC dimension of P + 1

  • Gradient boosted linear classifiers (with T parts of the ensemble) have a VC dimension of T(P + 1)(2 + 3 \log T(P + 1))

  • A linear model estimating P coefficients and an intercept has a VC dimension of P + 1

0/1 Loss

Other algorithms have an infinite VC dimension:

  • 1NN classifiers (are we guaranteed that that the trained classifier ever actually generalizes?)

  • RBF SVMs

Could have infinitely bad generalization performance

  • But doesn’t because the VC bound is a worst-case bound

  • Complicated point, happy to discuss more in office hours

0/1 Loss

What happens to the training offset as d and N change?

0/1 Loss

Some observations:

  • This bound is not sharp - the 0/1 error can’t be greater than 1. The VC bound shouldn’t really be expected to give you a good measure of the actual value of the generalization error.

  • As the classifier becomes more expressive the generalization error increases for fixed N.

  • This increase is offset as N increases.

0/1 Loss

Consider a family of functional forms that can range from a small finite VC dimensionality to an infinite VC dimensionality

Example: Polynomial Logistic Regression

2 features:

  • First order - [x_1,x_2]

  • Second order - [x_1,x_2,x_1^2,x_2^2,x_1x_2]

  • Third order - [x_1,x_2,x_1^2,x_2^2,x_1x_2,x_1^3,x_2^3,x_1x_2^2,x_1^2x_2]

0/1 Loss

For the training error, we expect that the training error decreases as the VC dimension/polynomial order of the classifier increases

  • More expressive means better fit to the training data!

In theory, with enough polynomial terms, there is no training set that can’t be matched perfectly!

However, at a certain point, there exists some VC dimensionality where adding more complexity does not viably increase the training error anymore

But, we do continue to increase the training offset!

  • Overfitting!

0/1 Loss

This type of learning algorithm that can range from simple to arbitrarily complex (indexed by tuning parameters) and guarantee 0 training error with enough complexity are called universal approximators

  • We’ll come back to these in a week

0/1 Loss

This means that we can rexpress the VC bound in terms of complexity, C:

\text{Generalization Error} \approx \mathcal O \left( \frac{1}{C} \right) + \mathcal O \left( \frac{C}{N} \right)

which looks a lot like the bias/variance tradeoff!

0/1 Loss

Takeaway:

Even for classification, we need to worry about overfitting

  • Just because we’re trying to put things in a small number of buckets doesn’t mean that we can ignore overfitting!

Generalization Error

Overall takeaways:

  • The optimal model in terms of generalization occurs when complexity of the learner is balanced - not too simple, not too complex, just right

  • Overfitting occurs when we allow excess complexity that no longer outweighs any gains in generalization due to decrease in model bias

  • We can tolerate more complexity as N gets larger. For truly complex problems (hint hint deep learning), you’ll need a lot of training data

Measuring Generalization Error

All of this is theoretical - to give you an understanding of when we can tolerate a little more complexity

  • The big one: when N is incredibly large

However, none of these tools really give us a general way to estimate the generalization error

  • Just some knowledge of how it relates to training error, sample size, and model complexity

Measuring Generalization Error

Why do we want to measure generalization error for a specific model?

  1. We want to choose the value of a tuning parameter that will minimize the generalization error
  2. We want to honestly assess how well we would expect our learning method to perform in a true out of sample setting.

Measuring Generalization Error

The theoretical tool:

We can replicate evaluating the expectation over the true data generating distribution if we have a true heldout test set

  • As the size of the heldout data gets large, the average error on the test set converges to the generalization error!

  • LLN at work

The important rule:

The test set cannot be used to train the model in any sense to adequately approximate the generalization error

  • The moment we use data to train the model, the predictions become correlated with the features and outcomes

  • Leads the estimate to approach training error which we know is lower than the generalization error

Measuring Generalization Error

We also need an “out-of-sample” data set that can be used to choose the appropriate value for any tuning parameters

  • Example: K for a KNN classifier

This data set is used to assess something like the generalization error for different values of K and choose the one that is likely to give us the lowest generalization error!

But, once we use data to make this choice, we can no longer use it to adequately estimate the true generalization error!

Measuring Generalization Error

The gold standard:

Start with all of your data and split it into three pieces

  • A training set used to estimate model coefficients (like regression coefficients)

  • A validation set used to choose tuning parameters (like the regularization factor in LASSO or K in KNN)

  • A test set used to estimate the true expected loss for a new prediction made with the model that never touches the training procedure

Measuring Generalization Error

An issue:

Each of the quantities we want to estimate using each of the three pieces will be more accurate the larger each split set is!

  • Bigger N means more accurate coefficients

  • Bigger N means a better estimate of the generalization error for choosing tuning parameters

  • Bigger N means a better estimate of the true generalization error

If the original training set isn’t really large, doing a three part split leads to poor estimates throughout!

  • A middle point is to use cross-validation methods

Measuring Generalization Error

A rule of thumb:

  • If your training data is really large, just create a three part split and use the same validation set each time to choose the values of any tuning parameters

  • If your training data is smaller, use K-fold cross validation methods to choose the values of any tuning parameters

Regardless of choice, you always want to set aside an adequately large true heldout test set to assess the generalization error for any final chosen model!

Measuring Generalization Error

You already know all of this stuff, so we’re not going to burn too much time on this point!

ML is all about generalization performance

  • So never make the mistake of choosing models based on training error

Next Class

We’re going to get more applied and discuss regression methods

  • Linear and Logistic Regression

  • Log Likelihoods and Equivalence to MSE/Probability Loss

  • Regularization and Penalized Methods for Better Generalization

Mean Squared Error

These two sources of orthogonal randomness mean that we can split our expectation into two pieces:

E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ E_D \left[ (y - \hat{y})^2 \right] \right]

  • D is our training data. Note that the prediction only depends on the training data drawn. With a different draw of D from \mathcal T, our prediction could be a little different. The prediction made at \mathbf x given the observed training set is \hat{y}.

  • y is random only in terms of the error variance. Given a value of \mathbf x, y = f(\mathbf x) + \epsilon.

Mean Squared Error

A neat random variable rule. Suppose a is a random variable and b is a constant, then:

E_a[(a - b)^2] = V_a[a] + (E_a[a] - b)^2

  • Arises from the fact that V[a] = E[a^2] + E[a]^2

Mean Squared Error

Let’s start by applying the expectation over the training data:

E_\mathcal T[(y - \hat{y})^2] = E_{y | x} \left[ V_D[\hat{y}] + (E_D[\hat{y}] - y)^2 \right]

  • The true value of the outcome is not random with respect to the training data.

Mean Squared Error

Note that \hat{y} is not random with respect to the true value of y - the training set is assumed fixed w.r.t. to the noise.

Now applying the other expectation:

E_\mathcal T[(y - \hat{y})^2] = V_D[\hat{y}] + V_{y | x}[y] + (E_D[\hat{y}] - E_{y | x}[y])^2

And finally assuming that the expected value of the noise draw is zero:

E_\mathcal T[(y - \hat{y})^2] = V_D[\hat{y}] + V_{y | x}[y] + (E_D[\hat{y}] - f(\mathbf x))^2

where f(\mathbf x) is the true function that maps the features to the outcome (e.g. the signal).